Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize traverse #4498

Merged
merged 31 commits into from May 1, 2024
Merged

Optimize traverse #4498

merged 31 commits into from May 1, 2024

Conversation

TimWSpence
Copy link
Member

@TimWSpence TimWSpence commented Aug 23, 2023

This addresses the StackSafeMonad part of #4408. I thought that there were enough changes in this to merit a separate PR for the Defer part.

Open questions (will edit these out of the description later):

  • The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

The following benchmarks were run on an AWS t3.xlarge:

Baseline (b08196e)

Benchmark (length) Mode Cnt Score Error Units
TraverseBench.filterList 10000 thrpt 20 3247.486 ± 270.817 ops/s
TraverseBench.filterVector 10000 thrpt 20 3351.585 ± 255.883 ops/s
TraverseBench.mapList 10000 thrpt 20 3198.905 ± 235.865 ops/s
TraverseBench.mapVector 10000 thrpt 20 3906.339 ± 225.508 ops/s
TraverseBench.traverseChain 10000 thrpt 20 556.706 ± 31.814 ops/s
TraverseBench.traverseChainError 10000 thrpt 20 1278.954 ± 74.242 ops/s
TraverseBench.traverseFilterChain 10000 thrpt 20 536.841 ± 38.696 ops/s
TraverseBench.traverseFilterList 10000 thrpt 20 532.103 ± 30.473 ops/s
TraverseBench.traverseFilterVector 10000 thrpt 20 590.312 ± 29.534 ops/s
TraverseBench.traverseList 10000 thrpt 20 537.023 ± 33.287 ops/s
TraverseBench.traverseListError 10000 thrpt 20 1674.930 ± 48.771 ops/s
TraverseBench.traverseVector 10000 thrpt 20 581.844 ± 32.074 ops/s
TraverseBench.traverseVectorError 10000 thrpt 20 1778.334 ± 135.859 ops/s
TraverseBench.traverse_Chain 10000 thrpt 20 490.822 ± 14.145 ops/s
TraverseBench.traverse_List 10000 thrpt 20 474.568 ± 28.995 ops/s
TraverseBench.traverse_Vector 10000 thrpt 20 622.924 ± 38.002 ops/s

Chain (2d5f4d7)

Benchmark (length) Mode Cnt Score Error Units
TraverseBench.filterList 10000 thrpt 20 3170.974 ± 146.628 ops/s
TraverseBench.filterVector 10000 thrpt 20 3217.881 ± 181.087 ops/s
TraverseBench.mapList 10000 thrpt 20 3443.969 ± 104.117 ops/s
TraverseBench.mapVector 10000 thrpt 20 3827.401 ± 148.074 ops/s
TraverseBench.traverseChain 10000 thrpt 20 784.845 ± 25.990 ops/s
TraverseBench.traverseChainError 10000 thrpt 20 1302.782 ± 62.342 ops/s
TraverseBench.traverseFilterChain 10000 thrpt 20 748.986 ± 62.151 ops/s
TraverseBench.traverseFilterList 10000 thrpt 20 669.817 ± 41.798 ops/s
TraverseBench.traverseFilterVector 10000 thrpt 20 618.644 ± 28.558 ops/s
TraverseBench.traverseList 10000 thrpt 20 583.413 ± 27.079 ops/s
TraverseBench.traverseListError 10000 thrpt 20 1253.510 ± 76.328 ops/s
TraverseBench.traverseVector 10000 thrpt 20 581.975 ± 27.619 ops/s
TraverseBench.traverseVectorError 10000 thrpt 20 1302.362 ± 66.015 ops/s
TraverseBench.traverse_Chain 10000 thrpt 20 3598.033 ± 108.420 ops/s
TraverseBench.traverse_List 10000 thrpt 20 4048.287 ± 158.090 ops/s
TraverseBench.traverse_Vector 10000 thrpt 20 4094.024 ± 101.115 ops/s

Vector (702ab8b)

Benchmark (length) Mode Cnt Score Error Units
TraverseBench.filterList 10000 thrpt 20 3305.299 ± 233.483 ops/s
TraverseBench.filterVector 10000 thrpt 20 3253.531 ± 133.993 ops/s
TraverseBench.mapList 10000 thrpt 20 3308.961 ± 138.697 ops/s
TraverseBench.mapVector 10000 thrpt 20 3911.187 ± 142.582 ops/s
TraverseBench.traverseChain 10000 thrpt 20 594.535 ± 29.777 ops/s
TraverseBench.traverseChainError 10000 thrpt 20 1084.548 ± 75.686 ops/s
TraverseBench.traverseFilterChain 10000 thrpt 20 642.895 ± 55.621 ops/s
TraverseBench.traverseFilterList 10000 thrpt 20 682.965 ± 41.234 ops/s
TraverseBench.traverseFilterVector 10000 thrpt 20 696.235 ± 59.945 ops/s
TraverseBench.traverseList 10000 thrpt 20 582.275 ± 35.877 ops/s
TraverseBench.traverseListError 10000 thrpt 20 1114.959 ± 52.641 ops/s
TraverseBench.traverseVector 10000 thrpt 20 638.235 ± 42.533 ops/s
TraverseBench.traverseVectorError 10000 thrpt 20 1098.038 ± 62.946 ops/s
TraverseBench.traverse_Chain 10000 thrpt 20 3480.086 ± 178.654 ops/s
TraverseBench.traverse_List 10000 thrpt 20 3950.650 ± 171.441 ops/s
TraverseBench.traverse_Vector 10000 thrpt 20 3695.495 ± 182.401 ops/s

@armanbilge
Copy link
Member

armanbilge commented Aug 23, 2023

  • The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

Can we add some new laws in terms of OptionT[Eval, _]? Edit: I don't think this will work unless we make sure that OptionT offers a StackSafeMonad if F is a StackSafeMonad.

Should I open a separate PR for the new benchmarks first?

Nah.

@johnynek
Copy link
Contributor

Can you post the results of the benchmarks? Apologies if I missed them.

val iter = fa.iterator
if (iter.hasNext) {
val first = iter.next()
G.map(iter.foldLeft(f(first)) { case (g, a) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be G.void(.. so that has a chance for an optimized implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch!

G.map(fa.iterator.foldLeft(G.pure(builder)) { case (bldrG, a) =>
G.flatMap(bldrG) { bldr =>
G.map(f(a)) {
case Some(b) => bldr += b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see why we know this is safe or lawful.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

It can't be: every append to the builder happens in flatMap so by law it must be sequential, not concurrent/parallel.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

Hmm. I'm not entirely sure how this applies to traverse. It doesn't have a notion of "recovery". For sure, each individual step that the traverse runs may have a notion of recovery, but that's just it: it will either succeed or it will fail. But there's no way to "recover" an intermediate structure from the traverse itself.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

Won't disagree here. We could do a benchmark to see how much performance we are leaving on the table with that strategy.

@TimWSpence
Copy link
Member Author

Can you post the results of the benchmarks? Apologies if I missed them.

Yes will do. I had originally asked whether I should add the new ones in a separate PR first which was why I didn't run them straight away

@TimWSpence
Copy link
Member Author

@armanbilge @johnynek apologies it's taken me so long to get round to this. I've updated with benchmarks for the mutable builder approach. Are you still worried about lawfulness? Shall I try to do it with immutable data structures?

Those traverse_ improvements though 😆

Is this necessary? The monad laws should ensure that it's safe to use
mutable builders. Nonetheless it will be good to confirm the performance
delta for using immutable data structures
@johnynek
Copy link
Contributor

johnynek commented Nov 8, 2023

Thanks for working on this. What a don't see is how a Monad such as fs2.Stream[IO, A] would interact here. In that. Case you can have laziness, concurrency and more than one output per input.

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

@armanbilge
Copy link
Member

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

private[cats] def traverseDirectly[G[_], A, B](
fa: IterableOnce[A]
)(f: A => G[B])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (accG, a) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do even better here by using a Vector.newBuilder, we can do even even better by using a dedicated builder for the desired collection (e.g. ArraySeq or List) and we can do even even even better by providing a size hint to the builder.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have misunderstood something, but it looks like that's how things were done before the commit 8cc742e was introduced, which added the immutable data structures. 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right :) let's just say this PR has been open a while and I lost track of the old stuff 😇

@johnynek
Copy link
Contributor

johnynek commented Nov 9, 2023

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

This is definitely a colloquial understanding of Monads, that they represent an sequential computation, but I don't see the law that allows us to make the claim you want.

For instance, in the current PR we are using immutable items inside the monadic type constructor. I think the current writing will work. But if we changed those to be mutable, I don't see a proof that it will work, even though it may.

For instance, consider this type:

sealed trait Tree[F[_], A]
case class Item[F[_], A](fa: A) extends Tree[F, A]
case class InsideF[F[_], A](treeF: F[Tree[F, A]]) extends Tree[F, A]
case class Branch[A](left: Tree[F, A], right: Tree[F, A]) extends Tree[F, A]

object Tree {
  def pure[F[_], A](a: A): Tree[F, A] = Item(a)
  def flatMap[F[_]: Functor, A, B](fa: Tree[F, A])(fn: A => Tree[F, B]): Tree[F, B] =
    fa match {
      case Item(a) => fn(a)
      case InsideF(fa) => InsideF(fa.map(flatMap(_)(fn)))
      case Branch(left, right) =>
        Branch(flatMap(left)(fn), flatMap(right)(fn))
    }
}

Now, I haven't checked if this follows the monad laws as we have expressed them, but I think it or something like it could. In this picture note we are recursively calling flatMap inside the F context and outside of it. So, if F is lazy, we return before we have fully recursed.

Couldn't something like this cause a problem?

I think saying "monads are sequential" is a bit too hand-wavey to bolt into the default implementation for every user of cats. I think we need a stronger argument than that.

To see if this is more performant than an immutable vector
if (iter.hasNext) {
val first = iter.next()
G.void(iter.foldLeft(f(first)) { case (g, a) =>
G.flatMap(g) { _ =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not G.productR(g, f(a)) here? Is it to avoid calling f when g may have already failed? I think a comment would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no good reason 😅 Thanks, I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it to avoid calling f when g may have already failed?

Maybe this is something we should consider though?

Edit: ahh, I see your comment on f95b087. Fair enough.

fa: IterableOnce[A]
)(f: A => G[Option[B]])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (bldrG, a) =>
G.flatMap(bldrG) { acc =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on why we didn't write:

G.map2(bldrG, f(a)) {
  case (acc, Some(b)) => acc :+ b
  case (acc, _) => acc
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point 👍

fa: IterableOnce[A]
)(f: A => G[B])(implicit G: StackSafeMonad[G]): G[List[B]] = {
G.map(fa.iterator.foldLeft(G.pure(List.empty[B])) { case (accG, a) =>
G.flatMap(accG) { acc =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment why not G.map2(accG, f(a))(_ :: _)

Since we know G is lazy (a StackSafeMonad has to be I think), I'm not sure the downside here. One answer would be we don't have to call f if we have already failed for a short-circuiting monad, but we are still iterating the whole list, so we are doing O(N) work. Adding the call to allocate the Monad doesn't seem like a big problem, since we have to allocate the function to pass to flatMap in the current case.

By calling map2 we are at least communicating to G what we are doing, and in principle, some monads could optimize this (e.g. a Parser can make a more optimized map2 than flatMap, and it can also be StackSafe since runs lazily only when input is passed to the resulting parser).

@TimWSpence
Copy link
Member Author

@valencik absolute legend ❤️ Couple of points:

  1. I'm dumb and not thinking straight 😅 I was thinking the Map implementation would be dependent on either Chain or Vector and so I was waiting for the outcome of that debate. I'll fix it now
  2. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...
  3. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

@TimWSpence
Copy link
Member Author

@valencik sorry to bother you. Just tagging you in case you missed my previous comment

@valencik
Copy link
Member

2. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

3. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

@TimWSpence
Copy link
Member Author

  1. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

  1. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

Yeah I'm more than happy to draw a line under this PR and follow up separately - it's been open far too long 😆 I'll move it out of draft now

@TimWSpence TimWSpence marked this pull request as ready for review March 21, 2024 16:49
@TimWSpence
Copy link
Member Author

TimWSpence commented Mar 22, 2024

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

@valencik
Copy link
Member

valencik commented Mar 22, 2024

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

Just to double check I understand, do you mean the implementation of traverseDirectly?

Yeah, let's switch that to Chain and I can rerun benchmarks.

@valencik
Copy link
Member

Currently running benchmarks again on a tweak which uses Chain in traverseDirectly and traverseFilterDirectly
https://github.com/typelevel/cats/compare/803107de2f728f648800513ef5cff242a1172fbf...valencik:cats:more-chain?expand=1

@valencik
Copy link
Member

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence
TimWSpence#2

Feel free to leave it for a day or two.
There's a ton of numbers here.
I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements.
But again, maybe we should save that for a follow up.

@TimWSpence
Copy link
Member Author

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence TimWSpence#2

Feel free to leave it for a day or two. There's a ton of numbers here. I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements. But again, maybe we should save that for a follow up.

Thanks so much again! Presumably that PR is the same as the Chain version that I've had in place multiple times in the history of this PR? 😆

How is traverseDirectlyVector different sorry?

Use `Chain` in `traverseDirectly` helpers
@valencik
Copy link
Member

Before we get into the weeds, my position remains that we should merge this PR if we are ok with the small regression to traverseError benchmarks.

@djspiewak Pinging you as the creator of these benchmarks, what do you think here? We've improved or held stable on on fronts except traverseError benchmarks.
Which, as a reminder, look like:

  @Benchmark
  def traverseVectorError(bh: Blackhole) = {
    val result = vectorT.traverse(vector) { i =>
      Eval.later {
        Blackhole.consumeCPU(Work)

        if (i == length * 0.3) {
          throw Failure
        }

        i * 2
      }
    }

    try {
      bh.consume(result.value)
    } catch {
      case Failure => ()
    }
  }
Click for a visual summary of benchmark results main vs this pr

Screenshot 2024-03-26 at 08-14-24 JMH Visualizer

My gut says these improvements are worth it and we should merge.


How is traverseDirectlyVector different sorry?

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

The commit: efa7ec0

Click for a visual summary of benchmark results this pr vs traverseDirectlyVector

Screenshot 2024-03-26 at 08-26-35 JMH Visualizer

@TimWSpence
Copy link
Member Author

TimWSpence commented Mar 26, 2024

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

Ah cool, thanks. Yeah I had asked about that before and I think we decided to leave that for potential follow-up

@TimWSpence
Copy link
Member Author

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

@valencik
Copy link
Member

valencik commented Apr 6, 2024

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

We need to make a call on whether or not we're ok with the small regression in the traverseError micro benchmarks.
Otherwise, I think we're good.

val (leftL, rightL) = fa.splitAt(leftSize)
runHalf(leftSize, leftL)
.flatMap { left =>
val right = runHalf(rightSize, rightL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an interesting benchmark would be do do Eval.defer(runHalf(rightSize, rightL)) which would short circuit for applicatives G that can short circuit.

I wonder how much that would hurt the non-short circuiting case... this is not for this PR, just something we may come back to in a later PR.

val rightSize = size - leftSize
runHalf(leftSize, idx)
.flatMap { left =>
val right = runHalf(rightSize, idx + leftSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the possibility to short circuit the right with Eval.defer(runHalf(rightSize, idx + leftSize)) but that would add cost to the cases where we do need to evaluate everything, which is probably the most common case.

@johnynek
Copy link
Contributor

johnynek commented Apr 26, 2024

So my read based on: #4498 (comment)

is that for a benchmark on Vector with a stack safe monad (Eval) that short circuits 30% the way through the Vector (by throwing, which is highly suspect to me, because that is an effect that Eval isn't designed to deal with), we are currently 12% slower.

I think the benchmark isn't one we should care about, but instead maybe OptionT[Eval, ?] would be a short-circuiting StackSafeMonad we could implement (although I don't think OptionT or EitherT preserve StackSafe-ness currently).

As to why this benchmark is currently slower, I think as the benchmark is written, in the new PR we fully interate over the entire Vector building a chain of map2, which in Eval is just implemented the naive way with flatMap. So, we iterate the entire vector for sure. Then we call .value and evaluate the Eval. That process will start running the Eval and throw when it finally gets to the index that throws.

In the old code (the code that builds the tree up), we build a tree by cutting the length (which we get in an O(1) call) in halfs recursively. Importantly, this process does not iterate over the Vector, it is just splitting the index and basically making a binary tree of indexes to call inside an Eval later. Then using an extra layer of Eval (to use map2Eval) we start evaluating the function call left to right. So, in this implementation we only access the first 30% of the indices before we hit the call to f that throws.

So put that way, I think what this means is that actually for Vector, short circuiting via the tree seems actually faster than directly using map2. This is possibly due to the fact that iterating the vector may have to fetch from memory not in cache and we have to do those fetches in the new code, but in the old code, we only fetch an index (possibly incurring some cache miss) when we actually are sure we will require the function to be evaluated.

I think perhaps the best call is to simply not do this optimization for traverse for Vector and keep the current code for that particular function. Seems that would be a quick change, and then I would be happy to merge.

@johnynek
Copy link
Contributor

btw: I was the person who implemented the tree style traverse in cats (having noticed it by starting with the TreeList datastructure I added to cats-collections: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/TreeList.scala)...

I now wonder if the implementation of two strategies in traverseViaChain actually is helping... I can't see why at the moment... There is a tree portion, and then a linear portion, the linear portion is currently set at 128 wide. The only motivation I can see for the linear portion is that may be cheaper to build a List than a Chain, but I don't recall if I benchmarked that.

For any future readers, benchmarking that code with a different width setting (say 32, 64, 128 (current), and 256) might be interesting, along with a version that is a tree all the way down (similar to the implementation of traverse_ in Vector).

If it is very close, it would be worth simplifying the code to only use the binary index tree algorithm for sure.

@TimWSpence
Copy link
Member Author

So my read based on: #4498 (comment)

is that for a benchmark on Vector with a stack safe monad (Eval) that short circuits 30% the way through the Vector (by throwing, which is highly suspect to me, because that is an effect that Eval isn't designed to deal with), we are currently 12% slower.

I think the benchmark isn't one we should care about, but instead maybe OptionT[Eval, ?] would be a short-circuiting StackSafeMonad we could implement (although I don't think OptionT or EitherT preserve StackSafe-ness currently).

As to why this benchmark is currently slower, I think as the benchmark is written, in the new PR we fully interate over the entire Vector building a chain of map2, which in Eval is just implemented the naive way with flatMap. So, we iterate the entire vector for sure. Then we call .value and evaluate the Eval. That process will start running the Eval and throw when it finally gets to the index that throws.

In the old code (the code that builds the tree up), we build a tree by cutting the length (which we get in an O(1) call) in halfs recursively. Importantly, this process does not iterate over the Vector, it is just splitting the index and basically making a binary tree of indexes to call inside an Eval later. Then using an extra layer of Eval (to use map2Eval) we start evaluating the function call left to right. So, in this implementation we only access the first 30% of the indices before we hit the call to f that throws.

So put that way, I think what this means is that actually for Vector, short circuiting via the tree seems actually faster than directly using map2. This is possibly due to the fact that iterating the vector may have to fetch from memory not in cache and we have to do those fetches in the new code, but in the old code, we only fetch an index (possibly incurring some cache miss) when we actually are sure we will require the function to be evaluated.

I think perhaps the best call is to simply not do this optimization for traverse for Vector and keep the current code for that particular function. Seems that would be a quick change, and then I would be happy to merge.

I can absolutely do that. Just for my own benefit I'm curious to understand the preference for keeping the performance in the exception-throwing case and sacrificing the improvement in the happy path case?

@johnynek
Copy link
Contributor

johnynek commented May 1, 2024

Actually sorry. I misread the plots. I was looking at this pr vs traverseDirectlyVector I think (the second set of metrics) but the relevant set is this pr vs main which were the first set of metrics.

Okay, yeah, it seems there are a lot of wins. Seems the success case is about as much faster as the error case is slower and since success should be more common than failure seems like we should merge.

We can merge as is. IMO.

Shall I click merge? Do we need to wait for anyone else?

@TimWSpence
Copy link
Member Author

TimWSpence commented May 1, 2024

Actually sorry. I misread the plots. I was looking at this pr vs traverseDirectlyVector I think (the second set of metrics) but the relevant set is this pr vs main which were the first set of metrics.

Okay, yeah, it seems there are a lot of wins. Seems the success case is about as much faster as the error case is slower and since success should be more common than failure seems like we should merge.

We can merge as is. IMO.

Shall I click merge? Do we need to wait for anyone else?

Thank you! I'm happy to merge 😆 I think @valencik was as well? Also sorry that there are so many versions of the benchmarks - it's gotten quite confusing

@valencik
Copy link
Member

valencik commented May 1, 2024

Thank you! I'm happy to merge 😆 I think @valencik was as well? Also sorry that there are so many versions of the benchmarks - it's gotten quite confusing

LET'S DO IT!! 🎉 🚀 🐱

@johnynek johnynek merged commit ffb1df6 into typelevel:main May 1, 2024
16 checks passed
@TimWSpence TimWSpence deleted the optimize-traverse branch May 2, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants